AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.75)

Neural Information Processing SystemsNov-20-2025, 22:46:53 GMT

Learning Safe Policies with Expert Guidance

We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify. In order to do this, we rely on the existence of demonstrations from expert policies, and we provide a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge. We propose two methods to solve the resulting optimization: an exact ellipsoid-based method and a method in the spirit of the follow-the-perturbed-leader algorithm. Our experiments demonstrate the behavior of our algorithm in both discrete and continuous problems. The trained agent safely avoids states with potential negative effects while imitating the behavior of the expert in the other states.

expert guidance, learning safe policy, name change, (3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.62)

Chrysomallis, Iason, Chalkiadakis, Georgios

Going Beyond Expert Performance via Deep Implicit Imitation Reinforcement Learning

arXiv.org Artificial IntelligenceNov-6-2025

Imitation learning traditionally requires complete state-action demonstrations from optimal or near-optimal experts. These requirements severely limit practical applicability, as many real-world scenarios provide only state observations without corresponding actions and expert performance is often suboptimal. In this paper we introduce a deep implicit imitation reinforcement learning framework that addresses both limitations by combining deep reinforcement learning with implicit imitation learning from observation-only datasets. Our main algorithm, Deep Implicit Imitation Q-Network (DIIQN), employs an action inference mechanism that reconstructs expert actions through online exploration and integrates a dynamic confidence mechanism that adaptively balances expert-guided and self-directed learning. This enables the agent to leverage expert guidance for accelerated training while maintaining capacity to surpass suboptimal expert performance. We further extend our framework with a Heterogeneous Actions DIIQN (HA-DIIQN) algorithm to tackle scenarios where expert and agent possess different action sets, a challenge previously unaddressed in the implicit imitation learning literature. HA-DIIQN introduces an infeasibility detection mechanism and a bridging procedure identifying alternative pathways connecting agent capabilities to expert guidance when direct action replication is impossible. Our experimental results demonstrate that DIIQN achieves up to 130% higher episodic returns compared to standard DQN, while consistently outperforming existing implicit imitation methods that cannot exceed expert performance. In heterogeneous action settings, HA-DIIQN learns up to 64% faster than baselines, leveraging expert datasets unusable by conventional approaches. Extensive parameter sensitivity analysis reveals the framework's robustness across varying dataset sizes and hyperparameter configurations.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2511.03616

Country: Europe (0.92)

Genre: Research Report > New Finding (0.87)

Industry:

Education (1.00)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceOct-7-2025

Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

Jiang, Zishang, Han, Jinyi, Li, Tingyun, Wang, Xinyi, Jiang, Sihang, Liang, Jiaqing, Dai, Zhaoqian, Ma, Shuguang, Yu, Fei, Xiao, Yanghua

Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.

large language model, machine learning, trajectory, (19 more...)

2510.0414

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Richards, Olivia, Obstein, Keith L., Simaan, Nabil

Exploring Accelerated Skill Acquisition via Tandem Training for Colonoscopy

arXiv.org Artificial IntelligenceJul-1-2025

New endoscopists require a large volume of expert-proctored colonoscopies to attain minimal competency. Developing multi-fingered, synchronized control of a colonoscope requires significant time and exposure to the device. Current training methods inhibit this development by relying on tool hand-off for expert demonstrations. There is a need for colonoscopy training tools that enable in-hand expert guidance in real-time. We present a new concept of a tandem training system that uses a telemanipulated preceptor colonoscope to guide novice users as they perform a colonoscopy. This system is capable of dual-control and can automatically toggle between expert and novice control of a standard colonoscope's angulation control wheels. Preliminary results from a user study with novice and expert users show the effectiveness of this device as a skill acquisition tool. We believe that this device has the potential to accelerate skill acquisition for colonoscopy and, in the future, enable individualized instruction and responsive teaching through bidirectional actuation.

artificial intelligence, control wheel, trainee, (12 more...)

doi: 10.31256/HSMR25.03

2506.24046

Country:

North America > United States (0.31)
Asia > Middle East > UAE (0.15)

Genre: Research Report (0.84)

Industry:

Health & Medicine > Therapeutic Area > Oncology > Colorectal Cancer (1.00)
Health & Medicine > Therapeutic Area > Gastroenterology (1.00)

Technology: Information Technology > Artificial Intelligence > Robots (0.30)

arXiv.org Artificial IntelligenceApr-29-2025

Dynamic Action Interpolation: A Universal Approach for Accelerating Reinforcement Learning with Expert Guidance

Cao, Wenjun

Reinforcement learning (RL) suffers from severe sample inefficiency, especially during early training, requiring extensive environmental interactions to perform competently. Existing methods tend to solve this by incorporating prior knowledge, but introduce significant architectural and implementation complexity. We propose Dynamic Action Interpolation (DAI), a universal yet straightforward framework that interpolates expert and RL actions via a time-varying weight $α(t)$, integrating into any Actor-Critic algorithm with just a few lines of code and without auxiliary networks or additional losses. Our theoretical analysis shows that DAI reshapes state visitation distributions to accelerate value function learning while preserving convergence guarantees. Empirical evaluations across MuJoCo continuous control tasks demonstrate that DAI improves early-stage performance by over 160\% on average and final performance by more than 50\%, with the Humanoid task showing a 4$\times$ improvement early on and a 2$\times$ gain at convergence. These results challenge the assumption that complex architectural modifications are necessary for sample-efficient reinforcement learning.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2504.18766

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceDec-24-2024

Large Language Model guided Deep Reinforcement Learning for Decision Making in Autonomous Driving

Pang, Hao, Wang, Zhenpo, Li, Guoqiang

Deep reinforcement learning (DRL) shows promising potential for autonomous driving decision-making. However, DRL demands extensive computational resources to achieve a qualified policy in complex driving scenarios due to its low learning efficiency. Moreover, leveraging expert guidance from human to enhance DRL performance incurs prohibitively high labor costs, which limits its practical application. In this study, we propose a novel large language model (LLM) guided deep reinforcement learning (LGDRL) framework for addressing the decision-making problem of autonomous vehicles. Within this framework, an LLM-based driving expert is integrated into the DRL to provide intelligent guidance for the learning process of DRL. Subsequently, in order to efficiently utilize the guidance of the LLM expert to enhance the performance of DRL decision-making policies, the learning and interaction process of DRL is enhanced through an innovative expert policy constrained algorithm and a novel LLM-intervened interaction mechanism. Experimental results demonstrate that our method not only achieves superior driving performance with a 90\% task success rate but also significantly improves the learning efficiency and expert guidance utilization efficiency compared to state-of-the-art baseline algorithms. Moreover, the proposed method enables the DRL agent to maintain consistent and reliable performance in the absence of LLM expert guidance. The code and supplementary videos are available at https://bitmobility.github.io/LGDRL/.

large language model, machine learning, reinforcement learning, (20 more...)

2412.18511

Genre: Research Report > New Finding (0.86)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)
Information Technology > Robotics & Automation (0.72)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsOct-7-2024, 22:52:53 GMT

Reviews: Learning Safe Policies with Expert Guidance

Learning from demonstrations usually faces an ill-posed problem of inferring the expert reward functions. To facilitate safe learning from demonstrations, the paper formulates a maximin learning problem over a convex reward polytope, in order to guarantee that the worst possible consistent reward would yield a policy that is not much worse than optimal. The assumption is that the reward is linear in known features. The authors proposed two method: (i) ellipsoid method and (ii) follow-the-perturbed leader using separation oracles and a given MDP solver. The experiment is done in a grid world setting, and a modified version of the cart-pole problem.

experiment, expert guidance, learning safe policy, (6 more...)

Industry: Education (0.39)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceSep-4-2023

Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with Expert Guidance

Yang, Qisen, Wang, Shenzhi, Zhang, Qihang, Huang, Gao, Song, Shiji

Offline reinforcement learning (RL) optimizes the policy on a previously collected dataset without any interactions with the environment, yet usually suffers from the distributional shift problem. To mitigate this issue, a typical solution is to impose a policy constraint on a policy improvement objective. However, existing methods generally adopt a ``one-size-fits-all'' practice, i.e., keeping only a single improvement-constraint balance for all the samples in a mini-batch or even the entire offline dataset. In this work, we argue that different samples should be treated with different policy constraint intensities. Based on this idea, a novel plug-in approach named Guided Offline RL (GORL) is proposed. GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample. We theoretically prove that the guidance provided by our method is rational and near-optimal. Extensive experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.

adaptive offline reinforcement learning, expert guidance, guide million

2309.01448

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)

Neural Information Processing SystemsFeb-14-2020, 20:27:49 GMT

Learning Safe Policies with Expert Guidance

Huang, Jessie, Wu, Fa, Precup, Doina, Cai, Yang

We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify. In order to do this, we rely on the existence of demonstrations from expert policies, and we provide a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge. We propose two methods to solve the resulting optimization: an exact ellipsoid-based method and a method in the spirit of the "follow-the-perturbed-leader" algorithm. Our experiments demonstrate the behavior of our algorithm in both discrete and continuous problems. The trained agent safely avoids states with potential negative effects while imitating the behavior of the expert in the other states.

algorithm, expert guidance, learning safe policy, (1 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)